{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcdd Exercise M6.03\n",
    "\n",
    "The aim of this exercise is to:\n",
    "\n",
    "* verifying if a random forest or a gradient-boosting decision tree overfit if\n",
    "  the number of estimators is not properly chosen;\n",
    "* use the early-stopping strategy to avoid adding unnecessary trees, to get\n",
    "  the best generalization performances.\n",
    "\n",
    "We use the California housing dataset to conduct our experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data, target = fetch_california_housing(return_X_y=True, as_frame=True)\n",
    "target *= 100  # rescale the target in k$\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0, test_size=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a gradient boosting decision tree with `max_depth=5` and\n",
    "`learning_rate=0.5`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Also create a random forest with fully grown trees by setting `max_depth=None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "For both the gradient-boosting and random forest models, create a validation\n",
    "curve using the training set to assess the impact of the number of trees on\n",
    "the performance of each model. Evaluate the list of parameters `param_range =\n",
    "np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n",
    "`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n",
    "right sign of the Mean Absolute Error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random forest models improve when increasing the number of trees in the\n",
    "ensemble. However, the scores reach a plateau where adding new trees just\n",
    "makes fitting and scoring slower.\n",
    "\n",
    "Now repeat the analysis for the gradient boosting model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gradient boosting models overfit when the number of trees is too large. To\n",
    "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
    "offers an early-stopping option. Internally, the algorithm uses an\n",
    "out-of-sample set to compute the generalization performance of the model at\n",
    "each addition of a tree. Thus, if the generalization performance is not\n",
    "improving for several iterations, it stops adding trees.\n",
    "\n",
    "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
    "of trees is certainly too large as we have seen above. Change the parameter\n",
    "`n_iter_no_change` such that the gradient boosting fitting stops after adding\n",
    "5 trees to avoid deterioration of the overall generalization performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Estimate the generalization performance of this model again using the\n",
    "`sklearn.metrics.mean_absolute_error` metric but this time using the test set\n",
    "that we held out at the beginning of the notebook. Compare the resulting value\n",
    "with the values observed in the validation curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}